Systems Biology Lab
2024-11-01
Model selection is about determining what a good model is for one or more of these tasks, or how to objectively compare models
| Term | Symbol | Definition |
|---|---|---|
| Degrees of freedom | \(\text{DF}\) | \(n - p\) |
| Residual sum of squares | \(\text{RSS}\) | \(\sum_{i=1}^n \left( y_i- f(\vec{x}_i,\vec{\beta}) \right)^2\) |
| Mean square error | \(\text{MSE}\) | \(\frac{1}{n} \text{RSS}\) |
| Residual standard error | \(\sqrt{\frac{1}{n - p}\text{RSS}}\) |
In which:
How the RSS depends on the number of predictors
The data set consists of 10 samples
And the rest are random vectors
When columns of the data matrix \(\boldsymbol{X}\) consists of 10 independent vectors (possibly just noise) then in the matrix equation
\[ \boldsymbol{X}\vec{\beta} = \vec{y} \] \(\boldsymbol{X}\) is a (invertible) \(10\times10\) matrix with rank 10, the response \(\vec{y}\) of length 10 will be in its column space and \(\vec{\beta}\) of length 10 has a solution.
This, in contrast to the case where the number of rows in \(\boldsymbol{X}\) is larger than the number of columns (predictors, terms in the model equation).
The third assumption and the previous graphs give us clues about at least two ways to objectively compare the appropriateness of statistical models
M1: \(\beta_0 + \beta_1 x\)
M2: \(\beta_0 + \beta_1 x + \beta_2 x^2\)
M3: \(\beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\)
M4: \(\beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4\)
| Model | Parameters | MSE | \(R^2\) |
|---|---|---|---|
| M0 | 1 | 1169.2 | 0.0000 |
| M1 | 2 | 74.6 | 0.9362 |
| M2 | 3 | 13.7 | 0.9883 |
| M3 | 4 | 11.0 | 0.9906 |
| M4 | 5 | 10.5 | 0.9910 |
This shows that model M4, the model with most parameters is the best model, right?
Even if the model describes underlying true relations
When repeatedly fitting to different training sets the fitted function varies a lot: small changes in the training set lead to large changes in the fitted function (parameter values).
Too simple models deviate systematically from the true relation: they can not fit the true relation.
\[\text{Adjusted }R^2 = 1 - \frac{\text{RSS}/(n-p)}{\text{TSS}/(n-1)}\]
where \(n\) is the number of measurements and \(p\) the number of parameters.
\[\text{BIC} = p \ln(n) - 2 \ln(\hat{L})\]
where \(\hat{L}\) is the likelihood of the data (see Chapter 19) at the optimal estimate of the model. When the error is normal distributed this is approximately equal to
\[\text{BIC} \approx p \ln(n) + n \ln\left( \text{MSE} \right )\] A lower BIC is an indication of a better model.
| Model | \(R^2\) | MSE test set | Adj. \(R^2\) | BIC |
|---|---|---|---|---|
| M0 | 0 | 1040 | 0 | 113.7 |
| M1 | 0.9362 | 86.1 | 0.929 | 85.84 |
| M2 | 0.9883 | 21.6 | 0.985 | 69.57 |
| M3 | 0.9906 | 25.5 | 0.987 | 69.61 |
| M4 | 0.991 | 28.1 | 0.985 | 71.49 |
ANOVA (ANalysis Of VAriance) offers a statistical test for the question whether one model provides a better fit than another model
\[F = \frac{(\text{RSS}_1 - \text{RSS}_2)/(d_1 -d_2)}{\text{RSS}_2/d_2} \approx 1\]
\[F = \frac{\text{slope red line}}{\text{slope yellow line}}\]
If \(F>1\) then the additional term and parameter in Model 2 contribute more to the reduction in RSS than expected by chance under the null hypothesis.
Analysis of Variance Table
Model 1: y ~ 1
Model 2: y ~ x
Res.Df RSS Df Sum of Sq F Pr(>F)
1 10 12861.7
2 9 820.5 1 12041 132.08 1.111e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: y ~ x
Model 2: y ~ x + I(x^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 9 820.47
2 8 150.25 1 670.21 35.684 0.000333 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Demonstrates that the reduction in \(\text{RSS}\) by adding term \(x^2\) is significant.
Analysis of Variance Table
Model 1: y ~ x + I(x^2)
Model 2: y ~ x + I(x^2) + I(x^3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 150.25
2 7 121.27 1 28.987 1.6732 0.2369
Demonstrates that the reduction in \(\text{RSS}\) by adding term \(x^3\) is not significant.
anova(): comparing all modelsAnalysis of Variance Table
Model 1: y ~ x
Model 2: y ~ x + I(x^2)
Model 3: y ~ x + I(x^2) + I(x^3)
Model 4: y ~ x + I(x^2) + I(x^3) + I(x^4)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 9 820.47
2 8 150.25 1 670.21 34.7658 0.001057 **
3 7 121.27 1 28.99 1.5036 0.266052
4 6 115.67 1 5.60 0.2904 0.609357
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using ANOVA as a criterion to select the best model we would choose M2.
lm()?| age | weight | height | gender |
|---|---|---|---|
| 21 | 65.6 | 174.0 | M |
| 23 | 71.8 | 175.3 | M |
| 28 | 80.7 | 193.5 | M |
| 23 | 72.6 | 186.5 | M |
| 22 | 78.8 | 187.2 | M |
| 21 | 74.8 | 181.5 | M |
Response weight: \(w\)
Predictors:
\[w = \beta_0 + \beta_2 \, h\]
\[w = \underbrace{\beta_0 + \beta_1 \, g}_{\text{intercept}} + \beta_2 \, h\]
\[w = \underbrace{\beta_0 + \beta_1 \, g}_{\text{intercept}} + \underbrace{(\beta_2 + \beta_3 \, g)}_{\text{slope}} h\]
weight ~ height
weight ~ gender + height
weight ~ gender * height
\[ y = \beta_0 \]
weight ~ 1
This simply predicts the weight to be the average weight, and does not use predictors
The \(\text{RSS}\) of this model is by definition equal to the \(\text{TSS}\) and equals \(9.01\times 10^{4}\). Its \(\text{MSE}\) equals \(178\).
Model M3 has the formula \(w = \beta_0 + \beta_1 \, g + \beta_2 \, h + \beta_3 \, g h\)
Analysis of Variance Table
Model 1: weight ~ 1
Model 2: weight ~ height
Model 3: weight ~ gender + height
Model 4: weight ~ gender * height
Res.Df RSS Df Sum of Sq F Pr(>F)
1 506 90123
2 505 43753 1 46370 599.4181 < 2.2e-16 ***
3 504 39043 1 4710 60.8802 3.534e-14 ***
4 503 38912 1 132 1.7043 0.1923
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model M21 with different intercepts but same slope, is the best among our models, according to this criterion
\[w = \beta_0 + \beta_2 \, h + \beta_3 \, g h\] Let’s call this model M2’
weight ~ height + height:gender
Compare it to M1 which is nested in M2’, and to M3 in which M2’ itself is nested.
This model is one with a common intercept for males and females but with different slopes
Perhaps a log-transformation of weight. The spread of data above the fitted line seems to be larger than below: the residuals are probably not normal-distributed.